Data Set Overview

The data that we have is from Kaggle and it shows the car features for each model of car from 1990 to 2017, which includes most of the car brands in the market such as “Audi”, “BMW”, “Tesla”, “Toyota”, “Nissan” etc. Also includes most of the models for each brand. The year column indicates the year that the model was produced. Engine fuel type is categorical data indicating the type of fuel the engine is required, the levels include “diesel”, “electric”, “premium unleaded (required)”, “regular unleaded” and so on, a total of 11 types of fuels. Engine horsepower indicates how powerful the car engine is, which is an important element in our next analysis. The engine cylinder shows the number of cylinders the engine has. Transmission type, driven wheels and number of doors are pretty straightforward. The market category shows how the automobile market defines the specific model, which can include tags such as “High-Performance”, “Luxury”, “Hybrid” etc. can also mix and match any number of tags. Vehicle size is measured by the total volume of the car, it has three levels “Compact”, “Large” and “Midsize”. Vehicle style shows whether the car is a regular sedan, coupe, cargo van or wagon, it has a total of 16 levels. Highway MPG and city MPG are measurements to show how fuel efficiency that car is. The popularity index shows how well the car was sold. Finally, the MSRP indicates the manufacturer’s suggested retail price of the car. As we can see, the dataset has a total of 11914 rows and 16 columns, which is sufficient for our later analysis.

data <- read.csv("datafile.csv", header = T)
# Remove empty rows and duplicate rows
car <- data[complete.cases(data), ]
car <- car[!duplicated(car), ]
# Remove outliers
Q1 <- quantile(car$MSRP, .25)
Q3 <- quantile(car$MSRP, .75)
IQR <- IQR(car$MSRP)
car <- subset(car, car$MSRP > (Q1 - 1.5*IQR) & car$MSRP < (Q3 + 1.5*IQR))
# Remove unused columns and keep only the following columns
columns <- c("Make",
             "Year",
             "Engine.HP", 
             "Engine.Cylinders", 
             "highway.MPG",
             "city.mpg",
             "Popularity",
             "MSRP")
car <- car[, columns]
attach(car)
head(car)
##   Make Year Engine.HP Engine.Cylinders highway.MPG city.mpg Popularity  MSRP
## 1  BMW 2011       335                6          26       19       3916 46135
## 2  BMW 2011       300                6          28       19       3916 40650
## 3  BMW 2011       300                6          28       20       3916 36350
## 4  BMW 2011       230                6          28       18       3916 29450
## 5  BMW 2011       230                6          28       18       3916 34500
## 6  BMW 2012       230                6          28       18       3916 31200

Goal of Analysis

The price of a car is the key that most people care about. So, the goal of this project is to analyze the relationships between each feature of the car and the manufacturer’s suggested retail price of the car. Which feature is highly correlated with the price and which one is not. With the correlation, we can even predict the price of a car with some key features. Since there is some feature that is not needed, such as the model which is an unknown factor of the car, each make can have a large variance between all the models. So, the features that we care about are “Make”, “Year”, “Engine horsepower”, “Engine cylinders”, “highway.MPG”, “city.mpg” and “Popularity”. We are going to find out which of these features can have the highest impact on the price of the car.

MSRP Distribution

For the MSRP distribution we have here, it is clear that the highest appearance of the price range is between 2000 – 2999, 1137 cars are filling in this range. Besides that part, the rest of the data appears in a relatively normal shape, with a right tail, that is because there are some luxury and performance cars have a high retail price, and the price range for most of the daily used cars is between 20,600 and 39,000 with a mean of 29,500 and a median of 29,000. The minimum car retail price is 2,000 dollars, and the maximum is 75,200. This is because we removed the outliers, with the outlier the maximum price could exceed 2 million. This makes sense, as we expected most of the cars in the market today are between 20,000 to 40,000 US dollars and some luxury or performance cars have an extremely high price even above a million. And the minimum price we obtained is economy cars with production year before 2000.

library(plotly)
plot_ly(car, x = ~MSRP, type="histogram", name = 'MSRP')

Central Limit Theorem

The central limit theorem in probability theory states that the distribution of a variable relative to its size increases as its sample size increases. This is done so assuming that all samples are the same size. CLT states that if large sample size is obtained, the mean of all the variables from that population will be roughly equal to that of the whole population, it also states that even though the distribution of the data is abnormal, the mean of the data is still within the mean of the population. In other words, if the sample size gets larger, the variance of the variables will also be equal to that of the population. In our case, we chose to use 4 different sample sizes, 10, 20, 30 and 40 which are randomly picked in the dataset. In our case, the population mean is 29,466.32 and the standard deviation is 15909.84. As we can see in the graph, the mean of each sample size is close to the population mean but as the sample size goes up, the standard deviation is getting smaller and smaller, and the distribution is getting even more narrow. This matches the central limit theorem, as the sample size gets larger, the shape of the sampling distribution becomes relatively normal.

means1 = rep(NA, 5000)
set.seed(544)
for(i in 1:5000){
  means1[i] = mean(sample(car$MSRP, size = 10))
}

means2 = rep(NA, 5000)
set.seed(544)
for(i in 1:5000){
  means2[i] = mean(sample(car$MSRP, size = 20))
}

means3 = rep(NA, 5000)
set.seed(544)
for(i in 1:5000){
  means3[i] = mean(sample(car$MSRP, size = 30))
}

means4 = rep(NA, 5000)
set.seed(544)
for(i in 1:5000){
  means4[i] = mean(sample(car$MSRP, size = 40))
}
cat(paste0("MSRP mean = ",round(mean(car$MSRP),2), ", SD = ", round(sd(car$MSRP),2),"\n"),
paste0("Sample size = 10, mean = ",round(mean(means1),2), ", SD = ", round(sd(means1),2),"\n"),
paste0("Sample size = 20, mean = ",round(mean(means2),2), ", SD = ", round(sd(means2),2),"\n"),
paste0("Sample size = 30, mean = ",round(mean(means3),2), ", SD = ", round(sd(means3),2),"\n"),
paste0("Sample size = 40, mean = ",round(mean(means4),2), ", SD = ", round(sd(means4),2)),sep = "")
## MSRP mean = 29466.32, SD = 15909.84
## Sample size = 10, mean = 29470.11, SD = 5001.86
## Sample size = 20, mean = 29451.15, SD = 3545.45
## Sample size = 30, mean = 29416.12, SD = 2914.68
## Sample size = 40, mean = 29441.33, SD = 2483.19
subplot(plot_ly(x = means1, type="histogram", name = 'Sample size = 10'),
plot_ly(x = means2, type="histogram", name = 'Sample size = 20'),
plot_ly(x = means3, type="histogram", name = 'Sample size = 30'),
plot_ly(x = means4, type="histogram", name = 'Sample size = 40'),nrows = 2)

General Analysis of MSRP

MSRP for different Make

The following table is a summary of MSRP based on different car makes. The highest average MSRP make is the well-known luxury brand, Maserati. The average retail price of Maserati is 71,000 dollars. We can see that it also has the highest minimum price, which means the variance of Maserati is small, this indicates that for most Maserati models, the price range is between 70,000 and 72,000 dollars. While Cadillac tells us a different story, as we can see in the table, although Cadillac has the fourth-highest average retail price, it has a relatively large variance as the minimum price for Cadillac is only 2000 dollars, and the highest Cadillac model has a retail price of 58,000 dollars. This can be caused by 2 reasons, first reason could be the dataset of Cadillac contains some extremely old models, produced before the year 2000. This makes sense as we mentioned earlier, the car price has a huge gap before and after the year 2000. Another reason is that there is a huge gap in retail prices between models in Cadillac. So, we did some inspection of the Cadillac to find out the year of the model whose retail price is below 1.5 IQR of the first quantile. The result shows our first guess is correct, that all the models that have a low retail price are produced before the year 2000, which means that all models made after the year 2000 have a relatively reasonable variance. When we look at the boxplot for the make vs. MSRP, we found that Mercedes, Porsche, SAAB and Volvo, these brands have a ridiculous interquartile range, which means the middle 50% of data points of these brands are widely spread, so that the price of each model have relatively big differences.

library(dplyr)
a <- aggregate(MSRP~Make, data = car, summary)
a %>% arrange(desc(MSRP[,"Mean"]))
##             Make MSRP.Min. MSRP.1st Qu. MSRP.Median MSRP.Mean MSRP.3rd Qu.
## 1       Maserati 69800.000    70400.000   71100.000 71000.000    71700.000
## 2          Lotus 43995.000    54990.000   65595.000 61717.750    68412.500
## 3     Alfa Romeo 53900.000    55900.000   63900.000 61600.000    65900.000
## 4       Cadillac  2000.000    44190.000   48777.500 49089.994    57595.000
## 5            BMW  4697.000    40425.000   46450.000 47150.061    56950.000
## 6        Genesis 41400.000    42650.000   43900.000 46616.667    49225.000
## 7     Land Rover  2561.000    37110.000   45570.000 45715.220    54950.000
## 8        Porsche  2667.000     7419.000   53900.000 45383.244    63900.000
## 9        Lincoln  2000.000    38301.250   42095.000 43118.847    47482.500
## 10      Infiniti  2000.000    35925.000   43300.000 42498.957    50400.000
## 11         Lexus  2000.000    36517.500   41900.000 41755.858    50417.500
## 12          Audi  2000.000    36850.000   43175.000 41593.393    53100.000
## 13 Mercedes-Benz  2000.000     4213.000   43325.000 37545.836    56037.500
## 14        HUMMER 30750.000    33390.000   36015.000 36464.412    39290.000
## 15         Acura  2000.000    23845.000   35905.000 33690.963    44800.000
## 16           GMC  2000.000    26328.750   32760.000 32444.085    40040.000
## 17         Volvo  2000.000     2179.000   37775.000 29724.684    43950.000
## 18         Buick  2000.000    26205.000   31027.500 29034.189    35845.000
## 19        Toyota  2000.000    21785.000   29985.000 28531.518    36115.000
## 20    Volkswagen  2000.000    24232.500   27700.000 28238.380    31191.250
## 21          Ford  2000.000    21941.250   29557.500 28222.931    36507.500
## 22     Chevrolet  2000.000    18934.250   26520.000 27991.537    36148.750
## 23          Saab  2000.000     2122.000   34975.000 27879.807    40400.000
## 24        Nissan  2000.000    20960.000   28570.000 27502.974    36840.000
## 25      Chrysler  2000.000    22055.000   29185.000 26722.963    34955.000
## 26         Honda  2000.000    22210.000   26150.000 26608.884    32550.000
## 27           Kia  2000.000    17741.250   23820.000 25318.750    31492.500
## 28       Hyundai  2000.000    19175.000   23600.000 24926.263    30100.000
## 29        Subaru  2000.000    21995.000   25645.000 24240.674    29995.000
## 30         Dodge  2000.000    17158.750   24477.500 22626.794    30495.000
## 31          FIAT 15990.000    19995.000   21880.000 22206.017    24697.500
## 32    Mitsubishi  2000.000    18395.000   23495.000 21316.351    27495.000
## 33         Mazda  2000.000    16050.000   22517.500 20106.556    26391.250
## 34         Scion 12480.000    16471.250   19205.000 19932.500    23190.000
## 35       Pontiac  2000.000    14930.000   22500.000 19800.044    27700.000
## 36        Suzuki  2000.000    15099.000   17642.000 18026.415    23524.000
## 37    Oldsmobile  2000.000     2000.000    2179.000 12843.795    23103.750
## 38      Plymouth  2000.000     2000.000    2000.000  3296.873     2000.000
##    MSRP.Max.
## 1  72000.000
## 2  74995.000
## 3  68400.000
## 4  75010.000
## 5  74700.000
## 6  54550.000
## 7  71450.000
## 8  75200.000
## 9  74260.000
## 10 67050.000
## 11 72520.000
## 12 74100.000
## 13 75000.000
## 14 43130.000
## 15 65950.000
## 16 71665.000
## 17 65700.000
## 18 49625.000
## 19 65215.000
## 20 74600.000
## 21 68996.000
## 22 75195.000
## 23 51330.000
## 24 60490.000
## 25 49470.000
## 26 47070.000
## 27 61900.000
## 28 68750.000
## 29 39995.000
## 30 65945.000
## 31 29235.000
## 32 38995.000
## 33 44015.000
## 34 31090.000
## 35 37610.000
## 36 31749.000
## 37 36795.000
## 38 44625.000
cadi <- car[which(Make == "Cadillac"), ]
Q1.cadi <- quantile(cadi$MSRP, .25)
Q3.cadi <- quantile(cadi$MSRP, .75)
IQR.cadi <- IQR(cadi$MSRP)
cadi[which(cadi$MSRP < (Q1.cadi - 1.5*IQR.cadi)), "Year"]
##  [1] 1991 1992 1993 1990 1991 1992 1999 2000 2000 2000 2000 1994 1995 1996 1993
plot_ly(car, x = ~Make, y = ~MSRP, type="box", name = 'MSRP')

MSRP for different engine cylinders

Below is the summary table for different numbers of the engine cylinder. As we can see in the table, despite the 0- and 12-cylinder cars, the rest of the cars follow the rule the higher the number of engine cylinders is the higher the retail price. 0 cylinder indicates that the car is electric, which also makes sense since generally speaking, electric cars are more expensive than gas cars. But the weird thing here is the 12-cylinder cars, is has the lowest average retail price. Our guess is for those 12-cylinder cars, the production year is before 2000. And the second table proved our hypothesis, we can see that all 12-cylinder cars are produced before the year 2000.

aggregate(MSRP~Engine.Cylinders, data = car, summary)
##   Engine.Cylinders MSRP.Min. MSRP.1st Qu. MSRP.Median MSRP.Mean MSRP.3rd Qu.
## 1                0  25120.00     25560.00    39900.00  34511.92     42400.00
## 2                3   2000.00      2000.00    13495.00   9698.00     14907.50
## 3                4   2000.00     17845.00    23800.00  23981.28     30100.00
## 4                5   2000.00      2232.00    24950.00  21765.60     33390.00
## 5                6   2000.00     25105.00    33095.00  32491.45     41990.00
## 6                8   2000.00     30432.50    39350.00  38628.92     51835.00
## 7               10  68340.00     68340.00    68340.00  68340.00     68340.00
## 8               12   2960.00      3185.25     5240.50   5353.25      7121.50
##   MSRP.Max.
## 1  43600.00
## 2  18175.00
## 3  74995.00
## 4  57200.00
## 5  75200.00
## 6  75195.00
## 7  68340.00
## 8   8488.00
car[which(car$Engine.Cylinders == 12), ]
##              Make Year Engine.HP Engine.Cylinders highway.MPG city.mpg
## 671 Mercedes-Benz 1992       402               12          14       10
## 672 Mercedes-Benz 1993       389               12          15       11
## 673 Mercedes-Benz 1993       389               12          16       12
## 674 Mercedes-Benz 1993       389               12          15       11
## 761           BMW 1995       322               12          19       12
## 762           BMW 1995       372               12          18       10
## 763           BMW 1996       322               12          19       12
## 765           BMW 1997       322               12          19       12
##     Popularity MSRP
## 671        617 2960
## 672        617 3108
## 673        617 4623
## 674        617 3211
## 761       3916 5858
## 762       3916 8488
## 763       3916 6954
## 765       3916 7624
plot_ly(car, x = ~Engine.Cylinders, y = ~MSRP, type="box", name = 'MSRP')

Mean of MSRP for different years

As we mentioned earlier, we noticed that the price before and after the year 2000 has a huge gap. Therefore, to prove this hypothesis, we made a visualization to show the trend of the price. And below is the line graph of the year versus MSRP. As we expected, there is a huge gap before and after the year 2000.

year_of_MSRP_mean<-aggregate(MSRP~Year, data = car, FUN = mean)
plot_ly(year_of_MSRP_mean,x = ~Year, y = ~MSRP, type = "scatter", mode = "lines+markers", name = "Mean of MSRP for different years")

Sampling

There is a total of 5 sampling methods. Like putting everyone’s name into a hat, random sampling is similar to drawing out several names. However, it is usually much harder to do since it requires a complete list of all the elements in the population. Unlike random sampling, systematic sampling is usually easier to do since it eliminates the list of all the elements. This method, which is similar to lining everyone up, eliminates the list of all the elements and uses all the numbers in the population. Although convenience sampling is generally easier to do, it is probably the worst technique for gathering data. In convenience sampling, the data is obtained by running into the first people that the survey encounters. The goal of cluster sampling is to divide the population into groups, which are usually geographically dispersed. Each element in the group is randomly selected. Another type of sampling that involves separating the population into groups is stratified sampling. For instance, if the population is composed of males and females, then a sample is taken from each stratum. The goal of sampling is to find a method that is likely to yield a representative sample of the population. As we can see in the below graph, the four sampling methods we used can clearly represent the population distribution, they are extremely similar in shape.

library(sampling)
set.seed(544)
sample1 = car[sample(1:nrow(car), size = 500), ]

N <- nrow(car)
n <- 500
k <- ceiling(N/n)
set.seed(544)
r = sample(k, 1)
sample2 <- car[seq(r, by = k, length = n),]

pik <- inclusionprobabilities(car$Popularity,500)
sample3 <- car[UPsystematic(pik) != 0,]

car1 <- car[order(car$Year),]
freq <- table(car1$Year)
st.sizes <- 500 * freq / sum(freq)
st.sizes <- as.vector(t(st.sizes))
st.sizes <- st.sizes[st.sizes!=0]
st.4 <- strata(car1, stratanames = "Year", size = st.sizes, method = "srswor")
sample4 <- getdata(car1, st.4)

subplot(plot_ly(x = car$MSRP, type="histogram", name = "Frequency of MSRP"),
plot_ly(x = sample1$MSRP, type="histogram", name = "Simple random sampling without replacement"),
plot_ly(x = sample2$MSRP, type="histogram", name = "Systematic sampling"),
plot_ly(x = sample3$MSRP, type="histogram", name = "Systematic sampling with unequal probabilities(Popularity)"),
plot_ly(x = sample4$MSRP, type="histogram", name = "Stratified sampling(Year)"),nrows = 3)

Conclusion

As the linear model shows, the relationship between each feature and retail price is clear. As the year, engine horsepower and fuel efficiency go up, the price goes up. As engine cylinders and popularity go up the price goes down. The engine cylinders and popularity might cause some misunderstanding, engine cylinders have an inverse relationship to MSRP is because of the existence of the 12-cylinder engine as we mentioned earlier, the popularity has an inverse relationship is because the cheaper car is much more popular than those luxury cars. And there are 73% of the data can be presented by this model.

m <- lm(MSRP ~ Year + Engine.Cylinders + Engine.HP + Popularity + city.mpg)
summary(m)
## 
## Call:
## lm(formula = MSRP ~ Year + Engine.Cylinders + Engine.HP + Popularity + 
##     city.mpg)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31714  -5364   -878   4817  42279 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.826e+06  2.998e+04 -60.893  < 2e-16 ***
## Year              9.099e+02  1.499e+01  60.703  < 2e-16 ***
## Engine.Cylinders -5.958e+02  1.033e+02  -5.766 8.35e-09 ***
## Engine.HP         1.234e+02  1.847e+00  66.801  < 2e-16 ***
## Popularity       -6.004e-01  5.706e-02 -10.521  < 2e-16 ***
## city.mpg          7.877e+01  1.684e+01   4.678 2.93e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8298 on 10148 degrees of freedom
## Multiple R-squared:  0.7281, Adjusted R-squared:  0.728 
## F-statistic:  5435 on 5 and 10148 DF,  p-value: < 2.2e-16